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MAXIMAL MARGIN HYPERPLANE 1 


f(x,w,b) = sign(w. x - b) 
denotes + | 
denotes - | 


Define the margin of a 
linear classifier as the 
width that the boundary 
could be increased by 
before hitting a 
datapoint. 


LINEAR CLASSIFIERS 1 
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е The maximum margin 


linear classifier is the 
linear classifier with 
the, maximum margin. 
This is the simplest 
kind of SVM (Called an 
LSVM) 


Support Vectors are 
those datapoints that 
the margin pushes up 
against 


SPECIFYING A LINE AND MARGIN 
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= How do we represent this mathematically? 


= ...in m input dimensions? 


SPECIFYING A LINE AND MARGIN 
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= Minus-plane = {x:w.x +b=-l} 


Classify as.. +1 if w.x+b>=]1 
-1 if w.x+tb<=-l 
Universe if -l<w.xtb<1 
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Plus-plane = {x:w.x+b=+tl} 
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Minus-plane = {x:w.x+b=-I} 


Any location in 
= The vector w is perpendicular to the Plus Plane R™: not 
necessarily a 
= Let x be any point оп the minus plane datapoint 


Let x* be the closest plus-plane-point to x. 
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= Plus-plane = {х:м.х+Ь = +1} 
= Minus-plane = {x:w.x+b=-] } 
= The vector w is perpendicular to the Plus Plane 


= Let x be any point on the minus plane 


Let х" be the closest plus-plane-point to x. 
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NI = Margin Width 


What we know: 
= у.х +ђ= +] 
= w.x +b=-| 
= хх му 
= х -х |= M 


It's now easy to get M in 
terms of w and b 
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AM = Margin Width 
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LEARNING THE 
MAXIMUM MARGIN 
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Given a guess of w and b we can 
= Compute whether all data points in the correct half-planes 
= Compute the width of the margin 


So now we just need to write a program to search the space of w's 
and b's to find the widest margin that matches all the datapoints. 


THE MATH 


= Training instances 
m x e Ў" 
= yet{-1,1} 
= Decision function 
= f(x) = sign(<w,x> + b) 
m we У" 
= beR 
= Find w and b that 
m Perfectly classify training instances 
= Assuming linear separability 


= Maximize margin 


THE MATH 


= For perfect classification, we want 
= у (SW,x> + b) > 0 for all i 
a Why? 

= To maximize the margin, we want 


= w that minimizes |w]? 


STRENGTHS OF SVMS 


= Good generalization in theory 
= Good generalization in practice 
= Work well with few training instances 


= Efficient algorithms 


WHAT IF SURFACE IS NON-LINEAR? 


WHAT IF SURFACE IS NON-LINEAR? 


WHEN LINEAR SEPARATORS FAIL 
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MAPPING INTO A NEW FEATURE SPACE 


Input Space Feature Space 


Ф:х > X= D(x) 
Ф(х Хо) = (KK X XX 1 Xo) 
= Rather than run SVM on x, run it on Ф(х) 
= Find non-linear separator in input space 


= What if D(x.) is really big? 
= Use kernels to compute it implicitly! 


KERNELS 


= Find kernel K such that 
= К(хүх,) = < Dix), Ф(х,)? 


= Computing K(x,,x,) should be efficient, much more so than computing Ф(х,) and 
Ф(х;) 


= Use K(x,,x,) іп SVM algorithm rather than <x,,x,> 


= Remarkably, this is possible 


THE POLYNOMIAL KERNEL 


= К(хүх,) = < X X> À 
= xX, = (X,, Xp) 
= X, = (Хо, X22) 
E < X X> = (хх + X12X22) 
= < Xp X> ? = (xi ^x? + хх + 2X11 X12 X21 X22) 
= Ф(х) = (ҳу, х^, Ү2хү X12) 
= Di) = (хор x», 2x Хээ) 


= Кох) = € Ф(х|), P(x) > 


THE POLYNOMIAL KERNEL 


= (x) contains all monomials of degree d 
= Useful in visual pattern recognition 
= Number of monomials 

= 16х16 pixel image 


= 1010 monomials of degree 5 
= Never explicitly compute D(x)! 


= Variation - K(x,,x,) = (€ x), Xx, > + 1) ? 


À FEW GOOD KERNELS 


Dot product kernel 


= К(хүх,) = < XX > 


Polynomial kernel 
= К(хх,| = € xix, 29 (Monomials of degree d) 


= K(x,%) = (€ xix, > + 1)9 (All monomials of degree 1,2,...,9) 


Gaussian kernel 


= Кох) = exp(-| xix (7262) 


= Radial basis functions 
Sigmoid kernel 

= K(x,,x,) = tanh(< хх, > + 9) 
= Neural networks 


Establishing "kernel-hood" from first principles is non-trivial 


THE KERNEL TRICK 


“Given an algorithm which is 
formulated in terms of a positive 
definite kernel K,, one can construct 
an alternative algorithm by replacing 
K, with another positive definite 
kernel K,” 


> SVMs can use the kernel trick 


EXOTIC KERNELS 


= Strings 

= Trees 

= Graphs 

= The hard part is establishing kernel-hood 


CONCLUSION 


SVMs find optimal linear separator 


The kernel trick makes SVMs non-linear learning algorithms 


THANK YOU 
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